The Problem with Hypothesis Testing

Moving Away from Testing to Estimation

Paul Johnson

Hypothesis Testing

Introducing the Method I’m Telling You to Forget

The Logic of Scientific Reasoning

  • Hypotheses are claims about the world (the population) that we can empirically test using a sample of data.
  • We cannot prove our hypothesis about the population true based on what we find in our sample – Deductive Reasoning.
  • We can only find enough evidence to reject a hypothesis (with a degree of certainty).
  • This is why hypothesis testing uses a null hypothesis and an alternative hypothesis.

Hypothesis Testing Framework

  1. Specify the null (\(H_0\)) and alternative (\(H_1\)) hypotheses, and significance level (\(\alpha\)).
  2. Generate the null distribution, given the data being analysed and the type of test that is most appropriate for this data.
  3. Compute the test statistic that describes the observed data, given the null distribution.
  4. Compute the p-value, the probability of observing a test statistic as large or larger than the observed test statistic if data generated by chance (and model assumptions are correct).

All of Science is a Lie!

How We Were Getting it Wrong

The Replication Crisis

  • Attempts to replicate a lot of published scientific findings often fail.
  • Some claims it is the majority of findings (Ioannidis 2005) and as many as 70% of researchers have tried and failed to reproduce experiments (Baker 2016).
  • I think these concerns are overstated – but trust issues and methodological flaws are real.

The Dreaded P-Value

  • The American Statistical Association (ASA) defines a p-value as “the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value (Wasserstein and Lazar 2016).
  • Well, that’s cleared up any confusion… right?

Testing with Precision

  • We need to have confidence that tests are well specified, that the results are substantive and meaningful, and that the outcomes can be replicated.
  • Science has given too much weight to p-values.
  • Not enough attention is paid to effect sizes, statistical power, and the garden of forking paths (Gelman and Loken 2021).

Does “Statistical Significance” Matter?

  • Statistical significance is an indication of how surprised we are by what we observe in the data, given our assumption that the null hypothesis is true.
  • Does it really matter if we are surprised or not?
  • Yes! But it is not the only consideration, and it’s not even the most important consideration.

Doing Things the Right Way

Don’t is Still Not Enough

Estimation vs Testing

  • Stop testing null hypotheses, start estimating meaningful quantities (Poole 2022).
  • Testing null hypotheses doesn’t tell us enough, and what it can tell us might not be reliable.
  • Building statistical models that estimate effects is more robust, more reliable, and more meaningful.

Embracing Uncertainty

  • We have to accept that we are measuring effects with uncertainty, and we have to embrace the idea that our findings will include real-world variation.
  • A “more refined goal of statistical analysis” is evaluating our uncertainty about the size of an effect (Greenland et al. 2016).
  • Report effect sizes and their confidence intervals, use p-values for validation purposes.

Significant No More

  • Stop dichotomising p-values!
  • A finding is not “significant” because p < 0.05, nor is a finding “not significant” because p > 0.05.
  • Stop referring to findings as “statistically significant” (or not statistically significant).
  • Report the p-value, not an inequality.

Thank You!

Contact:

SCW Data Science:

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604).
Gelman, Andrew, and Eric Loken. 2021. “The Statistical Crisis in Science.” American Scientist 102 (6): 460. https://doi.org/10.1511/2014.111.460.
Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, p-Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31: 337–50.
Ioannidis, John. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): e124.
Poole, Charles. 2022. “The Statistical Arc of Epidemiology.” "What is the Value of the P-Value?" Panel Discussion; UNC TraCS, Duke University, and Wake Forest University CTSA BERD Cores.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.